To get a better understanding about how price-setting in the sharing economy works, a wide range of papers have used a hedonic price model to test the consumer valuation of Airbnb listings (e.g. Gibbs et al. (2018), Teubner et al. (2017)). In this kind of modelling, structured attributes (number of rooms, location, rating, etc.) of the listing often together with attributes of the host are used, to evaluate the source of consumer utility.
In the following analysis I want to exploit the textual data in listing description to predict the price of a listing.
Research questions:
Method:
To compare my approach with the conventional methods, I first estimate a model in which I use the structured attributes as exogenous regressors to predict the price of an Airbnb listing. Afterwards, I use textual features of the same listings to predict the prices and compare the two models.
The project is divided into three parts. In this section I describe the data set and how I prepare it for analysis. In the second part I estimate a linear model with the conventional attributes and in the third part I use text data for the same listings.
I use a unique dataset that contains information on 47.006 Airbnb listings from seven major German cities, namely Berlin, Munich, Hamburg, Cologne, Dresden, Stuttgart and Frankfurt am Main. Listings were gathered directly from Airbnb’s website in September 2017 using a custom web scraper. In this way I have obtained all publicly available information for a listing, including but not limited to prices, accommodation features, reviews and host details.
head(rooms)## # A tibble: 6 x 62
## room_id host_id room_type country city neighborhood address price
## <int> <int> <chr> <chr> <chr> <chr> <chr> <int>
## 1 19117409 1.34e⁸ Entire ho… Deutsc… Hamb… <NA> Othmarsch… 129
## 2 5728058 3.34e⁵ Entire ho… Deutsc… Hamb… <NA> Neustadt,… 116
## 3 19954984 1.41e⁸ Entire ho… Deutsc… Münc… <NA> Schwabing… 91
## 4 9918551 5.10e⁷ Entire ho… Deutsc… Schö… <NA> Schönefeld 43
## 5 13836114 8.16e⁷ Entire ho… Deutsc… Hamb… <NA> Eimsbütte… 61
## 6 20355318 8.02e⁷ Entire ho… Deutsc… Köln <NA> Köln 49
## # ... with 54 more variables: nightly_price <int>, reviews <int>,
## # accommodates <int>, bathrooms <int>, bedrooms <int>, bed_type <chr>,
## # minstay <int>, last_modified <dttm>, latitude <dbl>, longitude <dbl>,
## # survey_id <int>, location <chr>, coworker_hosted <chr>,
## # extra_host_languages <chr>, name <chr>, property_type <chr>,
## # currency <chr>, rate_type <chr>, overall_satisfaction <chr>,
## # cleanliness_satisfaction <int>, communication_satisfaction <int>,
## # location_satisfaction <int>, accuracy_satisfaction <int>,
## # checkin_satisfaction <int>, value_satisfaction <chr>, amenities <chr>,
## # cancel_policy <chr>, instant_book <chr>, response_time <chr>,
## # response_rate <dbl>, friend_count <int>, wishlist_count <int>,
## # pic_count <chr>, superhost <chr>, description_language <chr>,
## # hostname <chr>, rule_children <chr>, rule_infants <chr>,
## # rule_pets <chr>, rule_smoking <chr>, rule_events <chr>,
## # hostprofilepic <chr>, cleaning_fee <chr>, security_deposit <chr>,
## # last_review <dttm>, positive_reviews <dttm>, negative_reviews <date>,
## # last_cal_update <chr>, member_since <chr>, host_verified <chr>,
## # deleted <chr>, filled <chr>, description <chr>, base_price <chr>
# Convert strings to numeric
rooms <- rooms %>%
mutate(overall_satisfaction = as.numeric(overall_satisfaction),
pic_count = as.numeric(pic_count)) %>%
filter(!is.na(overall_satisfaction))Keep only listings from the following cities: Hamburg, München, hamburg, Köln, FFM, Dresden, Stuttgart
## create clean-up function
create_city <- function(x, city){
city_clean <- ifelse(grepl(x, city),x , city)
return(city_clean)
}city_list <- c("Hamburg","München","Berlin","Frankfurt","Köln","Stuttgart","Dresden")
for(i in city_list){
rooms$city <- create_city(i, rooms$city)
}
rooms %>%
filter(city %in% city_list) -> rooms
rooms %>%
group_by(city) %>%
tally() %>%
ggplot(aes(reorder(city, n, desc),n)) +
geom_col(fill = col[3], alpha = 0.8) +
labs(x="", y="", title="Count")rooms %>%
group_by(property_type) %>%
tally() %>%
ggplot(aes(reorder(property_type, n),n)) +
geom_col(fill = col[3], alpha = 0.8) +
labs(x="", y="", title="Property Types") +
coord_flip()To keep things simple, I will just keep listings of property type “Wohnung” (apartment)
rooms %>%
filter(property_type == "Wohnung") -> roomsrooms %>%
ggplot(aes(room_type)) +
geom_bar(fill = col[3], alpha = 0.8) +
labs(x="", y="")rooms %>%
ggplot(aes(city, price)) +
geom_boxplot(outlier.size = 0)Apparently, there are some outliers. After cheking the respective listings, I decided to exclude them.
rooms %>%
filter(price < 1500) -> roomsrooms$price.cut <- cut(rooms$price, c(seq(0,500,1), Inf))
rooms %>%
ggplot(aes(as.numeric(price.cut), factor(city))) +
geom_density_ridges(scale = 5,
fill = col[3], alpha = 0.7,
color = "white") +
theme_ridges() +
scale_x_continuous(expand = c(0, 0), labels = c(seq(0,400,100),">500")) +
labs(y="", x="Price")rooms %>%
ggplot(aes(overall_satisfaction, factor(room_type))) +
geom_density_ridges(scale = 5,
fill = col[3], alpha = 0.7,
color = "white") +
scale_x_continuous(expand = c(0, 0)) +
labs(y="", x="Rating")Next, I exclude listings with less than three reviews, as it can be assumed that these listings have never been booked, or only very little.
rooms %>%
filter(reviews >= 3) -> roomsrooms$reviews.cut <- cut(rooms$reviews, c(seq(0,50,1), Inf))
rooms %>%
ggplot(aes(as.numeric(reviews.cut), factor(city))) +
geom_density_ridges(scale = 5,
fill = col[3], alpha = 0.7,
color = "white") +
scale_y_discrete(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks = c(seq(0,50,10)),
labels = c(seq(0,40,10),">50")) +
labs(y="", x="Number of Reviews")df <- rooms %>%
select(room_id, name,
description, city, price, overall_satisfaction,
room_type, bed_type, pic_count,
reviews, accommodates, bedrooms, minstay,
latitude, longitude) %>%
mutate(fulltext = paste(name, description, sep=" "))Turning to the text data, lets first have a quick look at three random descriptions:
rooms %>% sample_n(3) %>%
select(description) %>%
knitr::kable(align = "l")| description |
|---|
| 15-qm einfaches Zimmer in uriger Altbau-Wohnung. Kein Luxus aber gemütlich und freundlich. Informelle und lockere Atmosphäre. Schöne Gegend. Westzentrum 15 min zu Fuß. 300 m bis zum Zentralomnibusbahnhof (ZOB). Aufgrund einer Auflage von der Mutter meiner 2 Teenager-Töchter, die während der Woche 3-4 Tage zu Hause sind, kann ich leider keine unbekannte Männer in der Wohnung unterbringen. Auf Anfrage kann im Zimmer eine 2. weibliche Person bzw. Kind aufgenommen werden. |
| Beautiful Appartment at Arabellapark in East Munich Big Living Room, seperate Kitchen, two bedrooms, one with a King Size Bed for 2 People one room with 2 Single Beds. A Terrasse and a little garden come with the Appartment It’s with U4 (Underground) 12 minutes to the city center and the Thereseinwiese, where Oktoberfest takes place. Big Supermarkets, parcs nearby. 15 Minutes by foot to the beautiful Isar and English Garden. A good, spacious and homing place to experience Munich. |
| Meine schöne Altbau wohnung liegt superzentral in einem der gefragtesten Stadtteile von Hamburg / in St. Georg . Es ist hochwertig möbliert und modern gestylt. Direkt vor der Haustür befindet sich die Alster. ca 2 min Fussweg. Ihr könnt entlang der Alster vorbei in die Innenstadt laufen. Der Hauptbahnhof, sowie die Kunsthalle, Einkaufsmöglichkeiten, Restaurants, Bars, Clubs findet man direkt vor der Haustür. |
In which languages are the descriptions written?
load(file = "../output/prep1.Rda")df %>% group_by(language) %>%
tally() %>%
ggplot(aes(reorder(language, n),n)) +
geom_col(fill = col[3], alpha = 0.7) +
coord_flip() +
labs(x="",y="")Check sample articles if the classification is valid
df %>%
sample_n(5) %>%
select(fulltext, language) %>%
knitr::kable()| fulltext | language |
|---|---|
| Ruhige Unterkunft in Villengegend Sonniger Neubau in den Elbvororten. Nur 600m zu S Bahn und Bussen. Nahe der Internationalen Schule, der Elbe und verschiedenen Parks. Parkplatz vor der Tür. | german |
| Kl., ruhiges Zimmer im Glockenbach/Oktoberfest Gemütliche Wohnung mit großem hellen Wohnzimmer und kleinem, ruhigem Schlafzimmer zum Hinterhof bietet Platz für 2 Leute (160cm Bett). Cosy Room in lovely apartment close to scenic Glockenbachviertel, 10min to City Center by feet and 5min to UBahn. You can walk to the Oktoberfest within 10minutes. | german |
| Tolle Wohnung im Schanzenviertel Lichtdurchflutete Neubauwohnung mit bodentiefen Fenstern und Holzfußboden. Bestens geeignet für Familien mit mehreren Kindern oder Freunde bis 7 Personen Absolut zentral gelegen. 5 Minuten bis ins Schanzenviertel. | german |
| Große, helle 88qm Altbauwohnung Zentral gelegene 3-Zimmer Altbauwohnung mit guter Ausstattung in Moabit. | german |
| Ruhig, hell, für 1 Person oder Paar 40m2 - Wohnung, 4.Etage, Seitenflügel abseits der Straße, ruhige Nachbarn. Radio, TV+DVD, Kühlschrank, Wasserkocher, Kaffeemaschine, Geschirr,Spülmaschine, WLAN, Dusche, Waschmaschine, 1.40 breites Bett+ 2 Decken (2.Person gratis),Handtücher, Balkon. | german |
Ok, looks good. Lets only keep listings with german and english descriptions.
df %>%
filter(language %in% c("german","english")) -> dfggplot(df, aes(x=factor(city))) +
geom_bar(aes(fill = language),
alpha = 0.8) +
labs(x="", y="", fill="")It is not surprising that Berlin seems to be the most international city, measured by the listings that have their description in English. But I am a little disappointed with Hamburg…
How long are the descriptions on average?
df$text_length <- sapply(gregexpr("\\S+", df$fulltext), length)df$text_length.cut <- cut(df$text_length, c(seq(0,150,1),Inf))
df %>%
ggplot(aes(as.numeric(text_length.cut), factor(city))) +
geom_density_ridges(aes(fill = language),
color = "white", alpha = 0.8) +
scale_x_continuous(expand = c(0,0),
labels = c(seq(0,100,50),">150")) +
labs(y = "", x = "Word Count", fill= "") +
theme()Surprisingly, the English texts are longer.
Next, I have to pre-process the text data to be able to include it into my model. Text data is inherently high-dimensional, so to reduce this dimensionality the following steps will be applied:
df$text_cleaned <- gsub("[[:punct:]]", " ", df$fulltext)
df$text_cleaned <- gsub("[[:cntrl:]]", " ", df$text_cleaned)
df$text_cleaned <- gsub("[[:digit:]]", " ", df$text_cleaned)
df$text_cleaned <- gsub("^[[:space:]]+", " ", df$text_cleaned)
df$text_cleaned <- gsub("[[:space:]]+$", " ", df$text_cleaned)
df$text_cleaned <- tolower(df$text_cleaned)df$text_cleaned <- removeWords(df$text_cleaned, stopwords("english"))
df$text_cleaned <- removeWords(df$text_cleaned, stopwords("german"))token.df <- df %>%
tidytext::unnest_tokens(word, text_cleaned) %>%
filter(nchar(word) > 1) %>%
filter(nchar(word) < 30)
token.df %>%
count(word, sort = TRUE) %>%
ungroup() %>%
top_n(20, n) %>%
knitr::kable(align="l")| word | n |
|---|---|
| wohnung | 12264 |
| apartment | 9732 |
| zimmer | 8800 |
| room | 8529 |
| min | 8365 |
| berlin | 5994 |
| bahn | 5187 |
| restaurants | 4511 |
| minuten | 4289 |
| flat | 4200 |
| küche | 3877 |
| city | 3862 |
| nähe | 3800 |
| unterkunft | 3488 |
| bars | 3228 |
| qm | 3060 |
| direkt | 2992 |
| liegt | 2983 |
| station | 2955 |
| lage | 2916 |
bigram.df <- df %>%
unnest_tokens(bigram, text_cleaned,
token = "ngrams", n=2)
bigram.df %>%
count(bigram, sort = TRUE) %>%
ungroup() %>%
top_n(20, n) %>%
knitr::kable(align="l")| bigram | n |
|---|---|
| u bahn | 2699 |
| s bahn | 1870 |
| zimmer wohnung | 1497 |
| wohnung liegt | 1287 |
| prenzlauer berg | 1083 |
| living room | 1081 |
| city center | 989 |
| walking distance | 982 |
| unterkunft gut | 936 |
| bars restaurants | 891 |
| paare alleinreisende | 848 |
| gut paare | 832 |
| unterkunft nähe | 811 |
| restaurants bars | 786 |
| alleinreisende abenteurer | 771 |
| wohnung befindet | 751 |
| unmittelbarer nähe | 745 |
| unterkunft lieben | 733 |
| st pauli | 689 |
| lieben wegen | 678 |
corp <- corpus(df$text_cleaned)
docvars(corp)<-df$city #attaching the class labels to the corpus message text
col <- RColorBrewer::brewer.pal(10, "BrBG") c.plot <- corpus_subset(corp, docvar1=="Berlin")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 250, color = col)c.plot <- corpus_subset(corp, docvar1=="Hamburg")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 200, color = col)c.plot <- corpus_subset(corp, docvar1=="München")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)c.plot <- corpus_subset(corp, docvar1=="Köln")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)c.plot <- corpus_subset(corp, docvar1=="Frankfurt")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)c.plot <- corpus_subset(corp, docvar1=="Stuttgart")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)c.plot <- corpus_subset(corp, docvar1=="Dresden")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)